M-A Delsuc - IGBMC - Université de Strasbourg produced with Quarto & reveal CC 4.0 BY
Model Data Analysis
Galileo Galilei - 1 -
Big change in XVe XVIe - observation comes first !
Galileo, Pise 1564 - Florence 1642
Galileo observation of the moon with the naked eye
(source Wikipedia)
Galileo Galilei - 2 -
One example
…
bells allow to measure time
force is constant \(\Rightarrow\) acceleration is constant \(\Rightarrow\) speed grows linearly \(\Rightarrow\) distance grows quadratically
(source Wikipedia)
Galileo Galilei - 3 -
Well known example
(source Wikipedia)
van Leeuwenhoek - 1 -
van Leeuwenhoek late XVIIe,
Invent the first microscope
Figure 1: and discover cells… (sources Wikipedia)
today in biophysics
today in biophysics
today in biophysics
We need data
Data can be anything
A Dataset is a set of points in a multidimensionnel space.
numerical values
numerical values with error bars !
dates
texts
classification value
colors
yes/no
quantiles
etc…*
However the number of dimension can be large !
large dimension spaces are very unatural to us
Law of large numbers
binomial law with 10 draws
with large nb of draw
With a large number of draw, every stochastic law becomes predictive
large dimension spaces are very unatural to us
Law of truly large numbers:
FTICR-MS fid, 8 million points ~ 1 sec acqu. I have thousand of these
zoomed
re-zoomed
even VERY unlikely events will be occur
large dimension spaces are very unatural to us
1D random distribution
distance histogram
distance histogram
2D random distribution
distance histogram
distance to center histogram
All points are at the same distance !
large dimension spaces are very unatural to us
random matrices and their \(AA^t\) product
All random matrices are inversible !
We need a model
Model can be anything
analogic model
example of Galileo ramp
analytical equation newton equations of movement most of the physic we learn / we teach
image
the cell / microscopy
computer program
any kind of program modelling the system
molecular modelling
Ecological,
Climate,
…
modelling the measure
denoising / debluring
Size matters
The measure \(y\) is described by some function \(T_f() \quad\) the “model”
\(y = T_f(x) + \epsilon\)
\(\epsilon\) is the “noise”
with \(y\), we measured \(N\) points
\(y\) is the “measure”
the model \(x\) contains \(P\) parameters
\(x\) is the “result”
\(N > P\) – classical case – “fit”
fit of the parameters onto the data
modelling the phenomenon
\(N = P\) – a special case – “transformation”
requires an estimate of the inverse of \(T_f()\)
modelling the measure
\(N < P\) – the inverse problem – “reconstruction”
requires a-priori knowledge
modelling the knowledge
\(N>P\) Fit
modelling the phenomenon
simple example:
Top, scattered measures with approximate linear dependence
Bottom, linear fit using \(\ell_2\) or \(\ell_1\) norms, in presence or absence of outliers
Algorithmic
We simply try to find the model \(x\) minimize the distance \(d()\) between the measure \(y\) and modelled measure \(T_f(x)\) : \(\hat{x}\) such that \[d( T_f(\hat{x}), y)\] is minimum.
The distance can be either the Cartesian distance: \[d(a,b) = \sum_i (a_i - b_i)^2\] (also called the \(\ell_2\) norm) but can any other norm,
for instance the \(\quad \ell_1\) norm:\(\ell_1(a,b) = \sum_i |a_i - b_i|\)
or the cosine pseudo-norm: \(\quad d_c(a,b) = \cos(a,b) = \Vert a\Vert \Vert b \Vert \cos(\theta) \quad\) where \(\cos(\theta)\) is the angle in the multidimensionnal vector space
this is more commonly used as the cosine similarity equal to 1.0 when both vectors are proportional
or the spectral norm\(\quad d_S(a,b) = \sum_i(\sigma(a-b)_i) \quad\) where \(\sigma(M)_i\) is the ith singular value of the matrix \(M\).
or anything else…
code
import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import minimize# define utilitiesdef costL1(sol):"cost function implementing L1 norm" a,b = sol y = a*xdata+breturnsum(abs(y-ydata))def costL2(sol):"cost function implementing L2 norm" a,b = sol y = a*xdata+breturn np.sqrt(sum((y-ydata)**2))def draw(sol,label=None):"draw the results" a = sol[0] b = sol[1] plt.plot(xdata, a*xdata+b, label=label)# set-up sceneN =20xdata = np.linspace(-10,10,N)ydata = xdata + np.random.randn(N)plt.figure()plt.plot(xdata, ydata, 'o') # first image# do itd = plt.subplot(111)plt.plot(xdata,ydata,'o')plt.plot(xdata,xdata,':',label='True')ini = [0,0]resL1 = minimize(costL1, ini)draw(resL1.x,'$L_1$')resL2 = minimize(costL2, ini)draw(resL2.x,'$L_2$')plt.legend(loc=0)d.set_xlim(xmin=-11,xmax=11)plt.show()display(Markdown("""when the $\ell_2$ (Cartesian) norm is used, the result is said to be the **Maximum Likelyhood** solution or **ML** solution *(le Maximum de vraisemblance)*"""))
code
when the \(\ell_2\) (Cartesian) norm is used, the result is said to be the Maximum Likelyhood solution or ML solution (le Maximum de vraisemblance)
the Fourier transform the most used and known transform
linear / non-local / inversible / orthogonal
Hadamard transform
related to Fourier transform on {-1, 1}
linear / non-local / inversible / orthogonal
wavelet transform
transform on local frequencies
linear / semi-local / inversible / non-orthogonal
Hilbert transform
related to Fourier transform
relates real and imaginary parts of an analytical signal
linear / local / inversible / orthogonal
Laplace transform
transform on real exponential function basis
linear / non-local / non-inversible
Transforms
All these transform are linear
sum of \(T\) = \(T\) of sum
quantitative
some of these transform are inversible (Fourier, Hilbert, wavelet, …)
They can thus be expressed as square matrices ( \(N = P\) ) (eventually inversible)
\(N<P\) reconstruction
modelling knowledge
a-priori information
In many cases the measure is limited, and smaller than the system under study.
If we want to model the system, there is more degrees of freedom in the model \((P)\) than in the measure \((N)\). To handle this we need to have additional information, some kind of a-priori .
There are many cases and many ways to implement this a-priori
Different possibilities
using general principles
positivity, symetry, …
sparcity …
regularisation
using a program as an a-priori
parameterized modelling of the system:
molecular modelling
system biology
meteo, …
Machine Learning
two different approaches:
Statistical modelling
eg: PCA, SVM, Random Forest,…
Deep Neural Networks
to be seen later…
Algorithmic Background for Machine Learning
General Approaches
classification
supervised (classification)
given exemples along with “labels” try to infer the labels from the data.
unsupervised (clustering)
given exemples with no “labels”, try to define homogeneous classes.
dimension reduction
just draw (2D, eventually 3D) this multidimentional dataset
quantification
Regression
Interpolation
Extrapolation
Inversion
linear regression
linear relationship between measure and test
measure \(Y\)
Truth (estimated on a subset) \(X\)
relation (unknown) \(A\)\[
X = A Y + \epsilon
\]
find \(A'\) such that \(|| X - A'Y ||^2\)
is minimum ( actually any kind of meaningful distance)
optimisation problem
gradients method
stochastic gradient
newton method
…
\(A\) matrix
allows to predict\(X\) for additional measures
non-linear regression
non-linear relationship between measure and test
measure \(Y\)
Truth (estimated on a subset) \(X\)
relation (unknown) \(T()\)\[
X = T(Y) + \epsilon
\]
find \(T'\) such that \(|| X - T(Y) ||^2\)
is minimum ( actually any kind of meaningful distance)
optimisation problem
gradients method
stochastic gradient
newton method
…
\(T'\) operator
allows to predict\(X\) for additional measures
Pre-Processing
Some time, it is needed to pre-process data to get robust results
continuous data
To equalize different variables, it can be needed to normalize each variable in meand and standard deviation \(\Rightarrow\) replace \(y_i\) by \(y'_i = \frac{y_i - \bar{y}}{\sigma_y}\)
\(N < P\) – the inverse problem – “reconstruction”
requires a-priori knowledge
modelling the knowledge
Here — both \(N\) and \(P\) are huge
implicit data model (phenomenon) thanks to Neural Network structure
implicit measure space (measure) thanks to training set
implicit regularisation (knowledge) thanks to convergence algo
what do we control ?
Quality control
cross validation
From a data-set, a model is built ( learned ) which allows to predict or class new data.
cross validation allows to verify the quality of the model
use two data-set: one for training / one for test
measure cost function / computes costs on test set
\(\Rightarrow\) test extrapolation/interpolation capabilities
Jack knife: use the whole data but remove one element from dataset, then check the prediction for that one
does this for all elements !
or on a subset
…
Allows to chek quality
underfit
did not extract all possible information
overfit
interpret noisevery common/very dangerous
Other methods
There is whole variety of methods, and new ones are proposed every week !
but they differ in
being used or not
efficiency
speed
robustness
accessibility
…
a little statistics
Confusion Matrix
Let’s assume we make a test - PCR test for COVID-19 for instance. The test is said positive if a the target sequence is expressed over a given threshold, negative otherwise.
However, there are fluctuations, (in the kit, in the sample, in the cycles, …)
so several possible outcomes:
real
T
F
test
T
True Positive
False Positive
F
False Negative
True Negative
easy to define for \(N\)-valued tests…
Characteristic Values
Positive Predictive Value \[PPV = \frac {TP}{TP + FP}\]
Negative Predictive Value \[NPV = \frac {TN}{TN + FN}\]
sensitivity how good is detection of real positives
\[\text{sensitivity} = \frac{TP}{TP+FN}\] - selectivity how good is rejection of real negatives\[\text{selectivity} = \frac{TN}{FP+TN}\]
Matthews Correlation Coefficient (MCC) how good is the test altogether\[MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]
Important parameter, used to calibrate a method \[FDR = \frac{\text{False Positive}}{\text{All Positive}}\]\[FDR = \frac {FP}{TP + FP}\]\[FDR = 1-PPV\]
\(P(T/H)\): probability of a positive test: \(PPV\)\(\Rightarrow\) what we get
\(P(H/T)\): probability of a True Positive: \(TP\)\(\Rightarrow\) what we want
\(P(H)\): what I know about the system: \(\Rightarrow\)a-priori
\(P(T)\): what I know about the test: \(\Rightarrow\)a-priori
examples
AIDS test example
probability that a Positive test is a FP when the testee is \[P(AIDS/Positive,context) = \frac{P(Positive/AIDS, context) \quad P(AIDS/context)} {P(Positive/context)}\]
a 76 years person, with very limited social life
a 26 years person, with active social and sexual life
terrorist example
which a-priori … ? \(\qquad P(Terrorist / context)\)
THE ETHICAL problem of AI
social filtering
pre-selection
not present in the training set
or worse: present in the training set (selection biais - biaised training set)